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Abstract. Heat kernel pagerank is a variation of Personalized PageR- 
ank given in an exponential formulation. In this work, we present a sub- 
linear time algorithm for approximating the heat kernel pagerank of a 
graph. The algorithm works by simulating random walks of bounded 
length and runs in time O ( ^i 0 g(°-U ) 1 assuming performing a random 
walk step and sampling from a distribution with bounded support take 
constant time. 

The quantitative ranking of vertices obtained with heat kernel pagerank 
can be used for local clustering algorithms. We present an efficient local 
clustering algorithm that finds cuts by performing a sweep over a heat 
kernel pagerank vector, using the heat kernel pagerank approximation 
algorithm as a subroutine. Specifically, we show that for a subset S of 
Cheeger ratio </>, many vertices in S may serve as seeds for a heat kernel 
pagerank vector which will find a cut of conductance 0{\f(j)). 


Keywords: Heat kernel pagerank, heat kernel, local algorithms 

1 Introduction 

In large networks, many similar elements can be identified to a single, larger 
entity by the process of clustering. Increasing granularity in massive networks 
through clustering eases operations on the network. There is a large literature 
on the problem of identifying clusters in a graph ( [8137134120128129] ). and the 
problem has found many applications. However, in a variation of the graph 
clustering problem we may only be interested in a single cluster near one element 
in the graph. For this, local clustering algorithms are of greater use. 

As an example, the problem of finding a local cluster arises in protein net¬ 
works. A protein-protein interaction (PPI) network has undirected edges that 
represent an interaction between two proteins. Given two PPI networks, the goal 
of the pairwise alignment problem is to identify an optimal mapping between 
the networks that best represents a conserved biological function. In [27], a local 
clustering algorithm is applied from a specified protein to identify a group sim¬ 
ilar to that protein. Such local alignments are useful for analysis of a particular 
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component of a biological system (rather than at a systems level which will call 
for a global alignment). Local clustering is also a common tool for identifying 
communities in a network. A community is loosely defined as a subset of vertices 
in a graph which are more strongly connected internally than to vertices outside 
the subset. Properties of community structure in large, real world networks have 
been studied in |2S], for example, where local clustering algorithms are employed 
for identifying communities of varying quality. 

The goal of a local clustering algorithm is to identify a cluster in a graph near 
a specified vertex. Using only local structure avoids unnecessary computation 
over the entire graph. An important consequence of this are running times which 
are often in terms of the size of the small side of the partition, rather than of 
the entire graph. The best performing local clustering algorithms use probability 
diffusion processes over the graph to determine clusters (see Section [P] ) . In this 
paper we present a new algorithm which identifies a cut near a specified vertex 
with simple computations over a heat kernel pagerank vector. 

The theory behind using heat kernel pagerank for computing local clusters 
has been considered in previous work. Here we give an efficient approximation 
algorithm for computing heat kernel pagerank. Note that we use a “relaxed” 
notion of approximation which allows us to derive a sublinear probabilistic ap¬ 
proximation algorithm for heat kernel pagerank, while computing an exact or 
sharp approximation would require computation complexity of order similar to 
matrix multiplication. We use this sublinear approximation algorithm for effi¬ 
cient local clustering. 

1.1 Previous work 

Heat kernel and approximation of matrix exponentials. Heat kernel pagerank 
was first introduced in [9] as a variant of personalized PageRank [13]. While 
PageRank can be viewed as a geometric sum of random walks, the heat kernel 
pagerank is an exponential sum of random walks. An alternative interpretation 
of the heat kernel pagerank is related to the heat kernel of a graph as the funda¬ 
mental solution to the heat equation. As such, it has connections with diffusion 
and mixing properties of graphs and has been incorporated into a number of 
graph algorithmic primitives. 

Orecchia et al. use a variant of heat kernel random walks in their random¬ 
ized algorithm for computing a cut in a graph with prescribed balance con¬ 
straints [351 . A key subroutine in the algorithm is a procedure for computing 
e~~ A v for a positive semidefinite matrix A and a unit vector v in time 0(m) for 
graphs on n vertices and m edges. They show how this can be done with a small 
number of computations of the form A -1 u and applying the Spielman-Teng linear 
solver [38] . Their main result is a randomized algorithm that outputs a balanced 
cut in time 0(m polylog n). In a follow up paper, Sachdeva and Vishnoi [36] 
reduce inversion of positive semidefinite matrices to matrix exponentiation, thus 
proving that matrix exponentiation and matrix inversion are equivalent to poly¬ 
log factors. In particular, the nearly-linear running time of the balanced separator 
algorithm depends upon the nearly-linear time Spielman-Teng solver. 
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Another method for approximating matrix exponentials is given by Kloster 
and Gleich in [22]. They use a Gauss-Southwell iteration to approximate the 
Taylor series expansion of the column vector e p e c for transition probability 
matrix P and e c a standard basis vector. The algorithm runs in sublinear time 
assuming the maximum degree of the network is O(loglogJt). 

Local clustering. Local clustering algorithms were introduced in j35], where Spiel- 
man and Teng present a nearly-linear time algorithm for finding local partitions 
with certain balance constraints. Let $(5) denote the cut ratio of a subset S 
that we will later define as the Cheeger ratio. Then, given a graph and a subset 
of vertices S such that <&(S) < 4> and vol(S') < vol(G)/2, their algorithm finds 
a set of vertices T such that vol(T) > vol(£)/2 and $(T) < O(cj) 1 ^ 3 log 0 *- 1 ^ n) 
in time 0(m(logn/(j))°^). This seminal work incorporates the ideas of Lovasz 
and Simonovitz purer] on isoperimetric properties of random walks, and their 
algorithm works by simulating truncated random walks on the graph. Spielman 
and Teng later improve their approximation guarantee to O]^ 1 / 2 log 3 ^ 2 n) in a 
revised version of the paper [39]. 

The algorithm of [381391 improves the SDectral methods of m and a similar 
expression in [T] which use an eigenvector of the graph Laplacian to partition the 
vertices of a graph. However, the local approach of Spielman and Teng allows 
us to identify focused clusters without investigating the entire graph. For this 
reason, the running time of this and similar local algorithms are proportional to 
the size of the small side of the cut, rather than the entire graph. 

Andersen et al. [3] give an improved local clustering algorithm using approx¬ 
imate PageRank vectors. For a vertex subset S with Cheeger ratio <fi and volume 
k, they show that a PageRank vector can be used to find a set with Cheeger ratio 
O(^) 1 / 2 log 1 / 2 k). Their local clustering algorithm runs in time 0(cf)~ 1 mlog 4 m). 
The analysis of the above process was strengthened in [2] and emphasized that 
vertices with higher PageRank values will be on the same side of the cut as the 
starting vertex. 

Andersen and Peres [4] later simulate a volume-biased evolving set process 
to find sparse cuts. Although their approximation guarantee is the same as that 
of [3], their process yields a better ratio between the computational complex¬ 
ity of the algorithm on a given run and the volume of the output set. They 
call this value the work/volume ratio, and their evolving set algorithm achieves 
an expected ratio of log 3 / 2 n). This result is improved by Gharan and 

Trevisan in pj6] with an algorithm that finds a set of conductance at most 
(^(e -1 / 2 ^ 1 / 2 ) and achieves a work/volume ratio of 0(c e ^> -1 ^ 2 log 2 n) for tar¬ 
get volume <, and target conductance <fi. The complexity of their algorithm is 
achieved by running copies of an evolving set process in parallel. 


1.2 Our contributions 

In this paper, we give a probabilistic approximation algorithm for computing a 
vector that yields a ranking of vertices close to the heat kernel pagerank vector. 
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The approximation algorithm, ApproxHKPRseed, works by simulating random 
walks and computing contributions of these walks for each vertex in the graph. 
Assuming access to a constant-time query which returns the destination of a 
heat kernel random walk starting from a specified vertex, ApproxHKPRseed runs 

in time O( ,l°fogiog(°-i) )- I n the context of this paper, we strictly address heat 
kernel pagerank with a single vertex as a seed - an analogy to Personalized 
PageRank with total preference given to a single vertex. Note that heat kernel 
pagerank with a general preference vector (see Section [2]) is a combination of 
heat kernel pagerank with a single seed vertex. We refer the reader to [13] for 
this more general case. 

Using ApproxHKPRseed as a subroutine, we then present a local clustering 
algorithm that uses a ranking according to an approximate heat kernel pager¬ 
ank. Let G be a graph and S a proper vertex subset with volume <; < vol(G)/4 
and Cheeger ratio $(5) < </>. Then, with probability at least 1 — e, our algo¬ 
rithm outputs either a cutset T with vol(T) > vol(5')/2 and ?-local Cheeger 
ratio at most 0(y/4>) or a certificate that no such set exists. The algorithm has 
work/volume ratio of 0(c -1 e -3 lognlog(e -1 ) loglog(e -1 )). This result is formal¬ 
ized in Theorem [4] A summary of previous results and our contributions are 
given in Table [l] 


Algorithm 

Conductance of output set 

Work/volume ratio 

m 

0 (< p 112 log 3/2 n) 

polylog n) 

m 

0 ( c )) 1 / 2 log 1 / 2 n) 

OOT 1 polylog n) 

a 

0 {< t > 1/2 log 1 / 2 n) 

O^rf)- 1 / 2 polylog n ) 

EH 

0(e- 1/2 ^/ 2 ) 

0 (^( j )~ 1/2 polylog n ) 

This work 

O(0 1/2 ) 

0(? 1 e" 3 log n log(e 1 ) log log^" 1 )) 


Table 1: Summary of local clustering algorithms 


As a summary of the contributions of this work, 

(1) We present an algorithm for computing a heat kernel pagerank vector from a 
single seed vertex with (1 + e) approximation guarantee with high probability 

in time O f \ 

m rime iA e 3 loglog ( e -i) )■ 

(2) We present a local clustering algorithm which uses a ranking according to 
heat kernel pagerank. In our clustering algorithm we use the probabilistic 
approximation algorithm in |(l)| as a subroutine, which gives a sublinear-time 
local clustering algorithm. 

(3) Using the approximation guarantees of m and the analysis for m we 
present a local clustering algorithm which with high probability returns a set 
with Cheeger ratio at most 0(y/cf>), given a target ratio </>, with work/volume 
ratio 0(<r -1 e -3 lognlog(e _1 ) loglog(e -1 )) where c is proportional to the vol¬ 
ume of the output set. 
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(4) We validate the performance analysis by implementing our algorithms using 
several real and synthetic graphs as examples. The clusters that were de¬ 
rived in these examples using the local clustering algorithm and heat kernel 
pagerank approximation have Cheeger ratios as guaranteed in Theorem [4] 

The theory behind finding local cuts with heat kernel pagerank vectors was 
first presented in Using some of this analysis as a starting point, we 

provide the algorithm for computing local clusters, called ClusterHKPR. 


1.3 Organization 


The remainder of the paper is organized as follows. First, we give some definitions 
and useful facts in Section[2] We give a sublinear-time algorithm for approximat¬ 
ing heat kernel pagerank in Section[3] In Section[4]we give the analysis justifying 
our local clustering algorithm, which we present in Section |4~T| Sections [5] and [6] 
contain experimental results. In both sections, experiments are performed on 
real data and on synthetic graphs generated with random graph generators. In 
Section [5] we demonstrate how the rankings obtained using approximate heat 
kernel pagerank vectors are compared with rankings obtained using exact heat 
kernel pagerank vectors. In Section[6]we compute local clusters by implementing 
the ClusterHKPR algorithm. We compare the volume and Cheeger ratio of these 
clusters to those output by two existing sweep-based local clustering algorithms. 
The first is by a sweep of an exact heat kernel pagerank m to compare the 
effects of heat kernel pagerank computation, and the second by a PageRank 
vector [3]- PageRank has a similar expression as heat kernel pagerank except 
PageRank is a geometric sum whereas heat kernel pagerank can be viewed as 
an exponential sum. We expect better convergence rates from heat kernel (see 


Section 2.2). 


2 Preliminaries 

Let G = (V, E) be an undirected graph on n vertices and m edges. We use u ~ v 
to denote {u,v} £ E. The degree , d v , of a vertex v is the number of vertices u 
such that u ~ v. The volume of a set of vertices S C V is the total degree of its 
vertices, vol(S) = an< ^ the edge boundary of S is the set of edges with 

one vertex in S and the other outside of S, d(S) = {u ~ v : u £ S,v ^ S}. 
When discussing the full vertex set, V, we write S C G and vol(G) = vol(V). 

Let / £ R" be a row vector over the vertices of G. Then the support of / is 
the set of vertices with nonzero values in /, supp(/) = {u £ V : f(u) ^ 0}. For 
a subset of vertices S, we define f(S) = /(«)• 


2.1 A local Cheeger inequality 

The quality of a cut can be measured by the ratio of the number of edges between 
the two parts of the cut and the volume of the smaller side of the cut. This is 
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called the Cheeger ratio of a set, defined by 

* (5) =_m_. 

min(vol(S'), vol(y \ S)) 

The Cheeger constant of a graph is the minimal Cheeger ratio, 

$(G) = min $(5 I ). 

ScG 

Finally, for a given subset S' of a graph G, the local Cheeger ratio is defined 

$*(S) = min $(T). 
res 

Our local clustering algorithm is derived from a local version of the usual 
Cheeger inequalities which relate the Cheeger constant of a graph to an eigen¬ 
value associated to the graph. Namely, let the normalized Laplacian of a graph 
be the matrix L = D~ 1 / 2 (D — A)D~ 1 / 2 , where D is the diagonal matrix of vertex 
degrees and A is the unweighted, symmetric adjacency matrix. Also, let Lg be 
determined by a subset S of size |S|= s and define Lg = Dg 1 ^ 2 (Dg — Ag)D^ 1 ^ 2 
where Dg and Ag are the restricted matrices of D and A with rows and columns 
indexed by vertices in S. Then the eigenvalues Ag := Agp < A s ,2 < • • • < Ag jS 
of Lg are also known as the Dirichlet eigenvalues of 5, and are related to $*(5) 
by the following local Cheeger inequality [101: 

^($*(S)) 2 < Ag < $*(S). (1) 

The inequality ([l]) will be used to derive a relationship between a ranking 
according to heat kernel pagerank and sets with good Cheeger ratios. Details 
will be given in Section [4] 


2.2 Heat kernel and heat kernel pagerank 

The heat kernel pagerank vector has entries indexed by the vertices of the graph 
and involves two parameters; a non-negative real value t, representing the tem¬ 
perature, and a preference row vector / : V —> R, by the following equation: 

oo j-k 

= ( 2 ) 

k —0 

where P is the transition probability matrix 


( P)uv 


l/d u if u ~ v 
0 otherwise. 


When / is a probability distribution, the heat kernel pagerank can be re¬ 
garded as the expected distribution of a random walk according to the transition 
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probability matrix P. A starting distribution we will be particularly concerned 
with is that with all probability initially on a single vertex u, i.e. f = Xu where 
Xu is the indicator vector for vertex u. We will denote the heat kernel pagerank 
vector over this distribution by pt, u '•= Pt,x„ anc i refer to u as the seed vertex. 

The heat kernel of a graph is defined H t = e~ tA where A is the Laplace 
operator A = I — P. Then an alternative definition for heat kernel pagerank is 
p t j = fH t , and we have that heat kernel pagerank satisfies the heat differential 
equation 

Qj.Pt,f = ~ P)- (3) 

We can compare the heat kernel pagerank to the personalized PageRank 
vector, given by 

OO 

pr aJ =a'£(l-a) k fP k . (4) 

k =0 

In this definition, a is often called the jumping or reset constant, meaning that 
at any step the random walk may jump to a vertex taken from / with probability 
a. When f = Xu for some u , i.e. preference is given to a single vertex, the random 
walk is “reset” to the first vertex of the walk, u , with probability a. We note 
that, compared to the personalized PageRank vector, which can be viewed as 
a geometric sum, we can expect better convergence rates from the heat kernel 
pagerank, defined as an exponential sum. 


3 Heat Kernel Pagerank Approximation 

We begin our discussion of heat kernel pagerank approximation with an obser¬ 
vation. Each term in the infinite series defining heat kernel pagerank in ([2]) is 
of the form e~ t j^fP k for k € [0, oo]. The vector fP k is the distribution after k 
random walk steps with starting distribution /. Then, if we perform k steps of a 
random walk given by transition probability matrix P from starting distribution 
/ with probability pk = the heat kernel pagerank vector can be viewed 

as the expected distribution of this process. 

This suggests a natural way to approximate the heat kernel pagerank. That is, 
we can obtain a close approximation to the expected distribution with sufficiently 
many samples. Our algorithm operates as follows. We perform r random walks 
to approximate the infinite sum, choosing r large enough to bound the error. We 
also use the fact that very long walks are performed with small probability. As 
such, we limit the lengths of our random walks by a finite number K. Both r, K 
depend on a predetermined error bound e. 

In our analysis we will use the following definition of an e-approximate vector. 

Definition 1. Let G be a graph on n vertices, and let f : V —► R. be a vector 
over the vertices of G. Let ptj be the heat kernel pagerank vector according to f 
and t. Then we say that v € is an e-approximate vector of ptj if 
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1. for every vertex v €V in the support of v, 

(! - e)Pt,/(«) - e < v(v) < (1 + e)ptj{v), 

2. for every vertex with v(y) = 0, it must be that Ptj(v ) < e. 

We note that this is a rather coarse requirement for an approximation, but 
satisfies our needs for local clustering. In the following algorithm, we approx¬ 
imate pt jU by an e-approximate vector which we denote by pt.u- The running 
time of the algorithm is ()( i 0 g(°-i) )• The method and complexity of the 
algorithm, ApproxHKPRseed, are similar to the ApproxRow algorithm for per¬ 
sonalized PageRank given in |7j. 


ApproxHKPRseed (G, t, it, e) 

input: a graph G, t £ R + , seed vertex u € V, error parameter 0 < e < 1. 
output: p, an e-approximation of pt, u . 

initialize a 0-vector p of dimension n, where n = \ V\ 
r <- log n 

K c- , 1 °, E( ' 6 ^ —tt for some choice of contant c 

loglogp ) 

for r iterations do 

Start 

simulate a P random walk from vertex u where k steps are taken with prob¬ 
ability e _t |j and k < K 

let v be the last vertex visited in the walk 

pH «- pH +1 

End 
end for 

return 1/r • p 


Theorem 1. Let G be a graph and let u be a vertex of G. Then, the algorithm 
ApproxHKPRseed(G,t,u, e)outputs an e-approximate vector p t , u of the heat ker¬ 
nel pagerank pt. u for 0 < e < 1 with probability at least 1 — e. The running time 

of ApproxHKPRseed is ^( ^ogiogCe-") )- 

3.1 Analysis of the heat kernel pagerank algorithm 

Our analysis relies on the usual Chernoff bounds as stated below. 

Lemma 1 (|?j). Let Xj be independent Bernoulli random variables with X = 
Yf, Xi. Then, 

i— 1 
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1. for 0 < e < 1, P(X < (1 — e)rE(A)) < exp(—^-rE(A')) 

2. for 0 < e < 1, TP(X > (1 + e)rE(A)) < exp(—^rE(X)) 

3. for c>l, P(A > (1 + c)rE(X)) < exp(— |?’E(X)). 

Proof (Theorem QJ). Consider the random variable which takes on value fP k 
with probability pk = for k £ [0, oo). The expectation of this random 

variable is exactly ptj. Heat kernel pagerank can be understood as a series of 
distributions of weighted random walks over the vertices, and the weights are 
related to the number of steps taken in the walk. The series can be computed 
by simulating this process, i.e., draw k according to pk and compute fP k with 
sufficiently many random walks of length k. 

We approximate the infinite sum by limiting the walks to at most K steps. We 
will take K to be K = ^'gj/g^-i) ■ These interrupts risk the loss of contribution 

— t.K 

to the expected value, but can be upper bounded by e K] < | provided that 
t > K/\ogK. This is within the error bound for an approximate heat kernel 
pagerank. If t < K /log K , the expected length of the random walk is 


£ : 

fc =0 


2 ~H k 

k\ 


k = t < K/\ogK. 


Thus we can ignore walks of length more than K while maintaining Pt. u { v ) — e < 
Pt,u ( v ) < Pt,u( v ) f° r ever Y vertex v. 

Next we show how many samples are necessary for our approximation vectors. 
For k < K, our algorithm simulates k random walk steps with probability 
To be specific, for a fixed u, let Xf be the indicator random variable defined by 
X'f = 1 if a random walk beginning from vertex u ends at vertex v in k steps. Let 
X v be the random variable that considers the random walk process ending at 
vertex v in at most k steps. That is, X v assumes the vector Xf with probability 
e“*|y. Namely, we consider the combined random walk 


^ = £ e 

k<K 


— t L TAD 

k\ xk - 


Now, let p{k)t t u be the contribution to the heat kernel pagerank vector p t ^ u 
of walks of length at most k. The expectation of each X v is p(k)t, u {v)- Then, by 
Lemma [T] 

Ppf 1 ' < (1 - e)p(k)t, u (v ) • r) < exp(-p(k)t, u (v)re 2 /2) 

= exp(-(8 /e)p(k)t, u {v) log n) 

<n" 4 

for every component with pt,u(v) > e, since then p(k)t, u (v) > e/2. Similarly, 

P(X V > (1 + e)p(k) t , u (v ) • r) < exp(—p(fc) t; „(u)re 2 /4) 

= exp(—(4/e)p(fc) t|tl (?;) log n) 


< n 
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We conclude the analysis for the support of pt, u by noting that p tjU = W®, and 
we achieve an e-multiplicative error bound for every vertex v with > e 

with probability at least 1 — 0{n~ 2 ). 

On the other hand, if pt,u(v ) < e, by the third part of LemmaJI] P (p t , u (v) > 
2e) < n~ 8 / e . We may conclude that, with high probability, pt, u (v) < 2e. 

For the running time, we use the assumptions that performing a random walk 
step and drawing from a distribution with bounded support require constant 
time. These are incorporated in the random walk simulation, which dominates 
the computation. Therefore, for each of the r rounds, at most I\ steps of the 
random walk are simulated, giving a total of rK = logn • l0 gtog(e-i) ) = 

0(1) queries. □ 

Remark 1. This bound on K is not tight. However, it is enough to use cK 
for some small constant c to cluster vertices with e-approximate heat kernel 
pagerank vectors computed with bounded random walks. Regardless, this value 
O(K) is independent of the size of the graph and never affects the running time. 
See Section 0 for a futher discussion. 

Remark 2. We note that the algorithm works for any t, but a good choice of 
t will be related to the size of the local cluster S and a desirable convergence 
rate. In particular, the constraints put on t are necessary for our local clustering 
results, presented in Section [4) 

The algorithm for efficient heat kernel pagerank computation has promise 
for a variety of applications. It has been shown in m how to apply heat ker¬ 
nel pagerank in solving symmetric diagonally dominant linear systems with a 
boundary condition, for example. 

4 Finding Good Local Cuts 

The premise of the local clustering algorithm is to find a good cut near a specified 
vertex by performing a sweep over a vector associated to that vertex, which we 
will specify. Let p : V —>• R be a probability distribution vector over the vertices 
of the graph of support size N p = supp(p). Then, consider a probability-per-degree 
ordering of the vertices where p(vi)/d Vl > p(v 2 )/d V2 > ■ ■ ■ > p{vN p )/d VNp . Let 
S, be the set of the first i vertices per the ordering. We call each Si a segment. 
Then the process of investigating the cuts induced by the segments to find an 
optimal cut is called performing a sweep over p. 

In this section we will show how a sweep over a single heat kernel pagerank 
vector finds local cuts. Specifically, we show that for a subset S with vol(S') < 
vol(G)/4 and 4>(S) < 4>, and for a large number of vertices u G S, performing a 
sweep over the vector p t U) where p t , u is an e-approximation of p tjll , will find a 
set with Cheeger ratio at most 0(y/cj)). 

Remark 3. Though all the vertices in the support of the vector are sorted to build 
segments, in practice the sweep will be aborted after the volume of the current 
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segment is larger than the target size. This is the locality of the algorithm, and 
ensures that the amount of work performed is proportional to the volume of the 
output set. 

The c-local Cheeger ratio of a sweep over a vector v is the minimum Cheeger 
ratio over segments Si with volume 0 < vol(S'j) < 2g. Let ^(zz) the c-local 
Cheeger ratio of cuts over a sweep of v that separates sets of volume between 0 
and 2c. 

We will use the following bounds for heat kernel pagerank in terms of local 
Cheeger ratios and sweep cuts to reason that many vertices u can serve as good 
seeds for performing a sweep. 

Lemma 2. Let G be a graph and S a subset of vertices of volume c < vol(G)/4. 
Then the set of u £ S satisfying 

< pt u{s) < 

has volume at least c/2. 

To proof Lemma [2j we begin with some bounds for heat kernel pagerank in 
terms of local Cheeger ratios and sweep cuts. For a subset S, define fg to be the 
following distribution over the vertices, 

, , , jd u /vo\(S ) if u e S 

ts(u) = < 

I 0 otherwise. 


Then the expected value of pt, u (S) over u in 5 is given by: 

E(ft,«(S)) = E -4hsPt,u(S) 

Zts vol(5) 


= E 

uES 


d u 

vo^S 1 ) 


(XuH t )(S) 


= fsH t {S) 

= Pt, fs (S). 


(5) 


We will make use of the following result, given here without proof (see 
which bounds the expected value of pt,u(S) given by ^ in terms of local Cheeger 
ratios. 


Lemma 3 ( |lOj ). In a graph G, and for a subset S, the following holds: 

l -e~^ < PtJs {S). 

Next, we use an upper bound on the amount of probability remaining in S 
after sufficient mixing. This is an extension of a theorem given in m- 
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Theorem 2. Let G be a graph and S a subset of vertices with volume ? < 
vol(G)/4. Then, 


Ptj s (S) < 


‘“Mpt./s) 2 / 4 . 


To prove Theorem [2] we define the following for an arbitrary function / : 
V —>■ K. and any integer x with 0 < x < vol(G)/2, 


fix) 


max 

TCVxV,\T\=x 


El 


f(u,v) 


f(u)/d u , if u ~ v, 
0, otherwise. 


The above definition can be extended to all real values of x , 


fix) 


max 

TCVxV,\T\=x 


EZ a uv f(u,v), 

( u,v)eT 


Ctuv < 1 if it 



u~u 


Claim. Let Si be a segment according to a vector / : V —> R such that x = 
vol(Si) and f(y) > 0 for every v £ Si. Then 

fix) = EE fi = fi s *)• 

u£Si 


Proof. We are considering the maximum over a subset of vertex pairs T of size 
vol(S)). Since we are only adding values over vertex pairs which are edges in G , 
this maximum is achieved when 


fi x ) = EE fi u )/ d n 
u£Si V~u 

= EZ fi EE l / d ' 

U£Si V~IL 

= /(Si). 


□ 


Proof (Theorem^. Let Z be the lazy random walk Z = 1/2(7 + P). Then, 
fZ(S) = l/2(f(S)+ J2 f( u ’ v )) 

u~v£S 

= 1 / 2 ( El fi u ’ v )+ El /( u w)) 

u\/v£S uAv£S 

< l/2(/(vol(S) + |0(S)|) + /(vol(S) - |<9(5)1)) 

= l/2(/(vol(5)(l + $(£))) + /(vol(5)(l - $(5)))). 

Let ft = ptjs i an d let a; satisfy 0 < a; < 2? < vol(G)/2 and represent 
a volume of some set Si. Then taking cue from the above inequality, we can 
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associate S to Si, vol(S) to vol(S'j) = x and $(5) to the minimum Cheeger ratio 
of a set Si satisfying vol(Sj) = x < 2?, or i\(ptj s )- Then using Claim [4| 

ftZ(x) < l/2(/t(x(l + + ft(x( 1 - (PtJs))))- 

Now consider the following differential inequality, 

^.M x ) = -PtJsi 1 - W)(x) ( 6 ) 

= - 2 PtJsi 1 - z )(x) 

= -2f t (x) + 2f t Z(x) 

< -2 ft(x) + f t (x( 1 + $ s (pt,/ s ))) 

+ ft(x(l - ^(p tJs ))) (7) 

< 0 . ( 8 ) 

Line © follows from <©> and line ([8]) follows from the concavity of /. 

Consider gt(x) to be g t (x) = / 4 . Then, 


- 2g t (x) + g t {x{ 1 + $ s (Pt,/s))) + ^( x ( 1 “ MPtJs))) 

= -2 g t (x) + yjl + $dPt,fs)9t(x) + \J 1 - ®<;{pt,fs)9t(x) 
= (“2 + Y^l + 'Mpt./s) + Y^l - ®<;(pt,fs))9t(%) 

-Mpus ) 2 


< 

4 

= j t sM, 


~9t{x ) 


(9) 


where we use the fact that —2 + y/1 + y + y/1 — y < —y 2 / 4 for y S (0,1] in line 
©. Now, since / t (0) = g t { 0) and ^/t(ar)| t=0 < ^5t(a:)|t=o, 

- 2f t (x) + f t (x( 1 + $ ? (p t ,/ s ))) + /t(a;(l - ^(p t ,/ s ))) 

< -2g t (x) + g t (x( 1 + $ ? (p t ,/ s ))) + flt(a;(l - $ ? (pt,/ s ))), 

and in particular, ^f t (x) < ^g t (x). Since /o(a;) < go(x), we may conclude that 


/t(x) < g t (x) = \Jxe i$ dPt,/ s ) 2 / 4 . 


□ 

Using Lemma[3]and Theorem[2j we arrive at the following useful inequalities. 
Corollary 1. Let G be a graph and S a subset with volume c < vol(G)/4. Then, 

< p tJs (S) < 

We are now prepared to prove Lemma [2j 
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Proof flerarao^). Let F be the set of seeds F = {u € S : pt, u (S) < 2p t j s (S)}. 
Then, by 0, 

F = {u£S : Pt , u (S) < 2E(pt' U (S))}. 

Now we consider the set of vertices not included in F, 


E( Pt}U (S) \u(£F)>J2 ^y2E (p t>u (S)) 

vol(S \F) n ^ ^ taw 
rol (S) 

v ’ u(£F 


> 


Which implies 


vol(iS') 


> vol(5 \ F) or, vol(F) > c/2. 


□ 

We can use Lemma [2] to reason that many vertices u satisfy the above in¬ 
equalities, and thus can serve as good seeds for performing a sweep. 

4.1 A local graph clustering algorithm 

It follows from Lemma [2] that the ranking induced by a heat kernel pagerank 
vector with appropriate seed vertex can be used to find a cut with approximation 
guarantee 0(y/cf>) by choosing the appropriate t. To obtain a sublinear time local 
clustering algorithm for massive graphs, we use ApproxHKPRseed to efficiently 
compute an e-approximate heat kernel pagerank vector, p t , u , to rank vertices. 

The ranking induced by pt, u is not very different from that of a true vector 
p t>u in the support of p tyU (for an experimental analysis, see Section^. Namely, 
using the bounds of Lemma [3j we have 

Pt,u{S) > (1 - e)p tt u(S) - es, 

where s = |5|. In particular, 

1(1 _ e ) e -***(S) - es < p t , u (S) < (10) 

for a set of vertices u of volume at least c/2. 

Theorem 3. Let G be a graph and S C G a subset with vol(S)= c <vol(G)/4 ; 
|S| = s, and Cheeger ratio 4 > (5') < <f>. Let p tiU be an e-approximate of pt, u for 
some vertex u £ S. Then there is a subset St C S with vol(St)> c/2 for which a 
sweep over pt, u for any vertex u £ St with 

1. t = </> _1 log(^j + 2es) and 

2. ®s(p t ,u) 1 2 < 4/tlog(2) 
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finds a set with (,-local Cheeger ratio at most y/8~f>. 


Proof. Let u be a vertex in St as described in the theorem statement. Using the 
inequalities (101, we can bound the local Cheeger ratio by a sweep over p t y. 


e -t$*(S) < JL(^ e -^(«,U 2 /4 + es ) 

1 — e 


which implies 


< c -t<i>dPt.u) 2 / 4 ^ 2 \/^ ! csc t^(pt.n) 2 /4y 


and by the assumption [2] we have 


5 -t$*(S) < e -t$dp*, u ) 2 /4 


$*( 5 ) > 


(r^ + 2es ) 

$*( pt.u ) 2 log(^|+2es) 


Let x = log( yzy + 2es). Then, 


®Mjs) 2 <*&(S)+4x/t. 


Since $*(5) < $(5) < </> and t = it follows that <h ? (/S t , u ) < y/8(j). In 

particular, a sweep over p t}U finds a cut with Cheeger ratio 0(y/f)) as long as u 
is contained in St.. □ 


We are now prepared to give our algorithm for finding cuts locally with heat 
kernel pagerank. The algorithm takes as input a starting vertex u , a desired 
volume c for the cut set, and a target Cheeger ratio cj) for the cut set. Then, to 
find a set achieving a minimum <;-local Cheeger ratio, we perform a sweep over 
an approximate heat kernel pagerank vector with the starting vertex as a seed. 
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ClusterHKPR(G, u, s, tf>, e) 

input: a graph G, a vertex u, target cluster size s, target cluster volume ? < vol(G)/4, 

target Cheeger ratio < j >, error parameter t. 

output: a set T with ?/2 < vol(T') < 2?, <1>(T) < y/8^. 

1: t «- 0" 1 log(f^ + 2es) 

2: p «— ApproxHKPRseedCG, t, u, e) 

3: sort the vertices of G in the support of p according to the ranking p[v\/d v 
4: for j 6 [1, n] do 
5: Sj = U i<j y i 

6: if vol(Sj) > 2? then 

7: output NO CUT FOUND, break 

8: else if ?/2 < vol(Sj) < 2? and <1 '(Sj) < \/8(j) then 

9: output Sj 

10: else 

11: output NO CUT FOUND 

12: end if 

13: end for 


Theorem 4. Let G be a graph which contains a subset S of volume at most 
vol(G)/4 and Cheeger ratio bounded by <j>. Further, assume that u is contained in 
the set St C S as defined in Theorem [3| Then ClusterHKPRCG , u, s, g, 0, e)outputs 
a cutset T with g-local Cheeger ratio at most y/8<fi. The running time is the same 
as that of ApproxHKPRseed. 

Proof. Since it is given that u £ St for t = (f -1 log(+ 2es), and by the 
assumptions on G and S, Theorem [3] states that a sweep over the approximate 
heat kernel pagerank vector p will find a set with <j-local Cheeger ratio at most 
y/8^. The checks performed in line [8] of the algorithm discover such a set. 

The computational work reduces to the main procedures of computing the 
heat kernel pagerank vector in line [2] and performing a sweep over the vec¬ 
tor in line [4j Performing a sweep involves sorting the support of the vector 
(line [3]) and calculating the conductance of segments. From the guarantees of an 
e-approximate heat kernel pagerank vector, any vertex with average probabil¬ 
ity less than e will be excluded from the support. Then the volume of a vector 
p output in line [ 2 ] is 0(e" 1 ), and performing a sweep over p can be done in 
0(e -1 log(e -1 )) time. The algorithm is therefore dominated by the time to com¬ 
pute a heat kernel pagerank vector, and the total running time is 0( 'afogiogle-”) )• 

□ 


5 Ranking Vertices with Approximate Heat Kernel 
Pagerank 

The backbone procedure of the local clustering algorithm is the sweep: ranking 
the vertices of the graph according to their approximate heat kernel pagerank 
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values, and then testing the quality of the cluster obtained by adding vertices 
one at a time in the order of this ranking. To this end, in this section we compare 
the rankings of vertices obtained using exact heat kernel pagerank vectors with 
approximate heat kernel pagerank vectors. Specifically, we consider how accuracy 
changes as the upper bound of random walk lengths, A', vary. 

In the following experiments, we approximate heat kernel pagerank vectors by 
sampling random walks of length min{fc, A'}, where k is chosen with probability 
Pk = e _t |y. We test the values computed with different values of K. Since the 

expected value of a random walk length k chosen with probability pk = e _t )g is 
t, we set K to range from 1 to approximately t for a specified value of t. 

In each trial, for a given graph, seed vertex, and value of t, we compute an 
exact heat kernel pagerank vector and an approximate heat kernel pagerank 
vector pt jU using ApproxHKPRseed but limiting the length of random walks to K 
for various AT as described above. We then measure how similar the vectors are 
in two ways. First, we compare the vector values computed. Second, we compare 
the rankings obtained with each vector. The following are the measures used: 

1. Comparing vector values. We measure the error of the approximate vector 
Pt,u by examining the values computed for each vertex and comparing to an 
exact vector p tiU . We use the following measures: 

— Average L\ error: The average absolute error over all vertices of the 
graph, 

1 n 

average Ai error := - - Pt,u(vi) |. (H) 

i— 1 

e-error: The accumulated error in excess of an e-approximation (see 
Definition [lj) , 

e-error := ^ max{|p t)tt (i>) - p t , u (v)\-ept,u(v), 0} 

v£V,p t ' U (v)>0 

+ ^2 max{p t|U (u) — e, 0}. (12) 

vev,p t ,u(v )=o 


2. Comparing vector rankings. To measure the similarity of vertex rankings we 
use the intersection difference (see among others). For a ranked list of 

vertices A , let Ai be the set of items with the top i rankings. Then we use 
the following measures: 

— Intersection difference: Given two ranked lists of vertices, A and B 1 
each of length n, the intersection difference is 


intersection difference := dist(A,B ) 


1 yv | Ai © Bi | 

n 2 i 

i =1 


(13) 


where Ai © Bi denotes the symmetric difference (Ai \ Bi) U (Bi\ Ai). 
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— Top-A; intersection difference: The intersection difference among the 
top k elements in each ranking, 



(14) 


Intersection difference values lie in the range [0,1], where a difference of 0 
is achieved for identical rankings, and 1 for totally disjoint lists. In these 
experiments, A is the list of vertices ranked according to an exact heat 
kernel pagerank vector p t>u , and B is the list of vertices ranked according to 
an e-approximate heat kernel pagerank vector pt,u- 

In every trial we choose t = </> -1 log(|)(| + 2es) as specified in the local clus¬ 
tering algorithm stated in Sect ion |4~Tj This value depends on several parameters, 
including desired Cheeger ratio, cluster size, and cluster volume. Specifics are 
provided for each set of trials. 

5.1 Synthetic graphs 

Random graph models In this series of trials we use three different models of 
random graph generation provided in the NetworkX m Python package, which 
we describe presently. 

The first is the Watts-Strogatz small world model E, generated with the 
command connected_watts_strogatz in NetworkX. In this model, a ring of n 
vertices is created and then each vertex is connected to its d nearest neightbors. 
Then, with probability p, each edge ( u, v ) in the original construction is replaced 
by the edge ( u,w ) for a random existing vertex w. The model takes parameters 
n, d,p as input. 

The second is the preferential attachment (Barabasi-Albert) model [5j. Graphs 
in this model are created by adding n vertices one at a time, where each new 
vertex is adjoined to d edges where each edge is chosen with probability pro¬ 
portional to the degree of the neighboring vertex. This is generated with the 
barabasi_albert_graph generator in NetworkX. The model takes parameters 
n,d as input. 

The third NetworkX generator is powerlaw_cluster_graph, which uses the 
Holme and Kim algorithm for generating graphs with powerlaw degree distribu¬ 
tion and approximate average clustering [T9] . It is essentially the Barabasi-Albert 
model, but each random edge forms a triangle with another neighbor with prob¬ 
ability p. The model takes parameters n, d,p as input. 

Table [2] lists the random graph models used and their parameters. 

Procedure For every value of K that we test, we generate ten random graphs 
using each of the three random graph models. For each graph we choose a random 
seed vertex u with probability proportional to degree, and we choose t as t = 
tp- 1 log(y~- + 2es) according to the values in Table 3 Then for each graph we 
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Model 

Source 

Parameters 

small world 

Watts-Stragatz [41] 

n, the size of the vertex set, 

d, the number of neighbors each vertex is assigned, 
p, the probability of switching an edge. 

preferential 

attachment 

Barabasi- Albert [5] 

n, the size of the vertex set, 

d, the number of neighbors each vertex is assigned 

powerlaw 

cluster 

Holme and Kim [19] 

n, the size of the vertex set, 

d, the number of neighbors each vertex is assigned, 
p, the probability of forming a triangle 


Table 2: Random graph models used. 


compare an exact heat kernel pagerank vector p t , u and the average of two e- 
approximate heat kernel pagerank vectors pt, u - The results we present are the 
average over all trials for each K and each type of graph. We use d = 5 and 
p = 0.1 in every trial, and n = 100 for the first set of trials (Figure |T|) and 
n = 500 for the second (Figure [2]). These parameters are outlined in Table [3] 


Model 

\v\ 

d 

P 

e Target 

Target 

Cheeger ratio 

Target 
cluster size 

t 

cluster volume 


small world 

100 

5 

0.1 

0.1 

<j> = 0.05 

s = 100 

? 

= 500 

84.9 


500 

5 

0.1 

0.1 

<f> = 0.05 

s = 100 

? 

= 500 

84.9 

preferential 

100 

5 

- 

0.1 

(j> = 0.05 

O 

O 

II 

co 

? 

= 500 

84.9 

attachment 

500 

5 

- 

0.1 

(j> = 0.05 

s = 100 


= 500 

84.9 

powerlaw 

100 

5 

0.1 

0.1 

(j> = 0.05 

s = 100 


= 500 

84.9 

cluster 

500 

5 

0.1 

0.1 

<f> = 0.05 

o 

o 

II 

co 


= 500 

84.9 


Table 3: Parameters used for random graph generation and to compute t for 
vector computations. 


Discussion For each graph and value of K , we measure the e-error, the average 
L\ error, the intersection difference and the top-10 intersection difference of an 
approximate heat kernel pagerank vector as compared to an exact heat kernel 
pagerank vector. Figure |T] plots the above measures for graphs over n = 100 
vertices, while Figure [2] plots these measures for graphs over n = 500 vertices. 
In both Figures [I] and[2j each subplot charts a different notion of error (from 
top left, clockwise: e-error, average L\ error, intersection difference and top-10 
intersection difference) on the y-axis against K on the x-axis. 

In both sets of plots and for every measure of error, we see that in the pref¬ 
erential attachment and powerlaw graphs the error is minimized after limiting 
random walks to only length K = 10, regardless of the size. We observe a shal¬ 
lower decline in e-error, average L\ error, and intsersection differance for the 
small world graphs. In particular, we note that the intersection difference drops 
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Trials on 100-vertex random graphs. 





Fig. 1: Different measures of error for random graphs on 100 vertices when ap¬ 
proximating heat kernel pagerank with varying random walk lengths. 
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Trials on 500-vertex random graphs. 




K 

Intersection difference 



K 



K 


K 


Fig. 2: Different measures of error for random graphs on 500 when approximating 
heat kernel pagerank with varying random walk lengths. 
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significantly after 10 random walk steps for all random graphs on both 100 and 
500 vertices. For e = 0.1, K = 4 • l0 gipg( e -i) ~ 11.04 is enough to approximate 
the rankings for the purpose of local clustering. 


5.2 Real graphs 

Network data For the experiments in this section, and later in Section[6j we use 
the following graphs compiled from real data. The network data is summarized 
in Table |U 

1. (dolphins) A dolphin social network consisting of two families [52] • The 
seed vertex is chosen to be a prominent member of one of the families. 

2. (polbooks) A network of books about US politics published around the time 
of the 2004 Presidential election and sold on Amazon [23j. Edges represent 
frequent copurchases. 

3. (power) The topology of the US Western States Power Grid [40] , 

4. (facebook) A combined collection of Facebook ego-networks, including the 
ego vertices themselves [261 . 

5. (enron) An Enron email communication network (2lj . in which vertices 
represent email addresses and an edge (*, j) exists if an address i sent at 
least one email to address j. 


Network 

Source 

|U| 

\E\ 

Average degree 

dolphins 

Dolphins social network [32] 

62 

159 

5 

polbooks 

Copurchases of political books [23] 

105 

441 

8.8 

power 

Power grid topology [40] 

4941 

6594 

2.7 

facebook 

Facebook ego-networks ]26] 

4039 

88234 

43.7 

enron 

Enron communication network ]2l] 

36692 

183831 

10 


Table 4: Graphs compiled from real data. 


The network data for graphs [TJ [2j and [3] were taken from Mark Newman’s 
network data collection [33]. The network data for graphs [4] and [5] are from the 
SNAP Large Network Dataset Collection [24] . 


Procedure In this series of experiments, the seed vertex u was chosen to 
be a known member of a cluster. As before, t was chosen according to t = 
</> -1 log(^| + 2es) with the values in Table 5 For each graph and for each value 
of K we compare an exact heat kernel pagerank vector p t u with an e-approximate 
heat kernel pagerank vector p tu . Specifically, we consider the average L\ distance 
(11) and the intersection difference (13). We again choose K to range from 1 to 


t. 
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e 

Target 

Cheeger ratio 

Target 
cluster size 

Target 

cluster volume 

t 

0.1 

0 = 0.05 

s = 100 

? = 1000 

95.6 


Table 5: Parameters used to compute t for vector computations. 


Discussion Figure [3] plots the average L\ error on the y-axis against different 
values of K on the x-axis for each of the dolphins, polbooks, and power graphs. 
Figure [4] plots the intersection difference on the y-axis against K on the x-axis. 



Fig. 3: Average error in each component for e-approximate heat kernel pagerank 
vectors when allowing varying random walk lengths. 


First we discuss the average L i error. The dolphins and the polbooks graphs 
exihibit properties of both the small world graphs and the preferential attach¬ 
ment graphs of the previous section (Figures |T] and [2j . Like the preferential 
attachment models, there is a significant drop in average L i error after K = 5, 
and like the small world model the error continues to drop for larger values of 
K , approaching a minimum error of « 0.003. The average L\ error in the power 
graph, on the other hand, is small for all values of K. We remark that, represent¬ 
ing a power grid, the graph has very small average vertex degree, so few random 
walk steps are enough to approximate the stationary distribution. 

As for the intersection difference, we observe a smaller variance in values for 
the three graphs. Regardless of the size or structure of the graph, the intersection 
difference drops sharply from K = 1 to K = 5. For values larger than I\ = 10 < 
4- , where e = 0.1, the intersection difference decreases only marginally. 

The purpose of these experiments was to evaluate how error and differences 
of ranking change in heat kernel pagerank approximation when varying K , the 
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Fig. 4: Intersersection difference of the ranked lists of vertices computed by exact 
and e-approximate heat kernel pagerank vectors when allowing varying random 
walk lengths. 


upper bound on number of steps taken in random walks. We found that setting 


an upper bound for random walk lengths to K = 10 < 4- 


i°g(e b 


with e = 0.1 


log log(e -1 ) 

according to Theorem [I] yields approximations which satisfy the prescribed error 
bounds. This value is independent of the size of the graph and t , and depends 
only on e. Namely, we observed that choosing K this way results in a significant 
decrease in both average L\ error and intersection difference as compared to 
smaller values of K , and only slight decrease in average L\ error and intersection 
difference for larger values of K as demonstrated in Figures 13 § i and |4j 
Further, we tested graphs of various size, random graphs generated from various 
models (see Section 5.11, and graphs from real data representing social networks, 
copurchasing networks, and topological grids (see Section 5.21. We found this 
choice of K was optimal for every graph regardless of size or structure. That is, 
the cutoff for random walk lengths does not depend on the size of the graph. 

It is also worth mentioning that the most striking outlier among the subject 
graphs is the small world graph, or expander graphs. This is due to the fact 
that the graph consists of a single cluster, which makes local cluster detection 
ineffective. 


6 An Assessment of Cheeger Ratios Obtained with Local 
Clustering Algorithms 

The goal of this section is to analyze the quality of local clusters computed with a 
sweep over an approximate heat kernel pagerank vector (see Section [4] for details 
on sweeps). We consider two objectives for analysis. 
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The first objective is to validate the statement of Theorem |4j To do this, we 
show that the Cheeger ratios of local clusters computed with sweeps over ap¬ 
proximate heat kernel pagerank vectors are within the approximation guarantees 
of Theorem |4j We use a slightly modified version of ClusterHKPR to compute 
these clusters. We call this modified algorithm eHKPR, and it is described in 
the list below. 

The second objective is to compare clusters computed with sweeps over dif¬ 
ferent vectors. Namely, for a given graph and seed vertex, we compare the local 
clusters computed using the following sweep algorithms: 


1. (eHKPR) A sweep over an e-approximate heat kernel pagerank vector 
is performed. The segment S with volume vol(S') < vol(G)/2 of minimal 
Cheeger ratio is output. This is the ClusterHKPR algorithm with the follow¬ 
ing modification: we allow segments of volume up to vol(G)/2 rather than 
limiting the search to segments of volume < 2?, twice the target volume. 

2. (HKPR) A sweep over an exact heat kernel pagerank vector is performed. 
The segment S with volume vol(S') < vol(G)/2 of minimal Cheeger ratio is 
output. This algorithm was outlined, but not stated explicitely, in TOl . 

3. (PR) A sweep over a Personalized PageRank vector 0 is performed. The 
segment S with volume vol(S') < vol(G)/2 of minimal Cheeger ratio is out¬ 
put. This is an adaptation of the algorithm PageRank-Nibble[S] with the 
following modifications: (i) rather than performing a sweep over an approx¬ 
imate PageRank vector, perform a sweep over an exact PageRank vector, 
and (ii) allow segments only as large as vol(G)/2. 


We summarize the algorithms and parameters below in Table [6] 


Algorithm 

Sweep vector 

Algorithm parameters 

Sweep vector parameters 

eHKPR 

pt,u 

</>, target Cheeger ratio 
s, target cluster size 
target cluster volume 

t = (f)- 1 log( + 2es) 

u, seed vertex 
e, approximation parameter 

HKPR 

pt,u 

4>, target Cheeger ratio 
s, target cluster size 

t = log s 

u, seed vertex 

PR 

P r a,v 

</>, target Cheeger ratio 

a = 07255 ln(10(Vm) 

u, seed vertex 


Table 6: Algorithms used for comparing local clusters. 


Each trial will resemble Procedure [lj as stated below. 
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Procedure 1 Compare Clusters 

Let G be a graph and u a seed vertex 
Choose parameters <f>, s, <;, e 

Let Sa be a local cluster computed using the algorithm eHKPR 

Let Sb be a local cluster computed using the algorithm HKPR 

Let Sc be a local cluster computed using the algorithm PR 

Compare Sa, Sb, Sc- 


The following sections describe the experiments in more detail. 


6.1 Synthetic graphs 

In this section, we use graphs generated with three random graph models: Watts- 
Strogatz small world, Barabasi-Albert preferential attachment, and Holme and 
Kim’s powerlaw cluster as described in Section [5TT| 


Procedure We perform twenty-five trials of Procedure [T] and take the averages 
of Cheeger ratios and cluster volumes computed. Specifically, we fix a model 
and algorithm parameters. Then, generate a random graph according to the 
model and parameters. For each random graph, pick a random seed vertex with 
probability proportional to degree. Then, for each seed vertex compute local 
clusters Sa,Sb,Sc using the algorithms eHKPR, HKPR, and PR, respectively. 
We then use the average Cheeger ratio and cluster volume of the Sa,Sb,Sc 
for comparison. In Table [ 7 ] we summarize the parameters used for each random 
graph model. 


Model 

\v\ 

d 

V 

€ 

Target 

Cheeger ratio 

Target 
cluster size 

Target 

cluster volume 

small world 

100 

5 

0.1 

0.1 

0.1 

20 

100 


500 

5 

0.1 

0.1 

0.1 

100 

500 


800 

5 

0.1 

0.1 

0.1 

100 

500 


1000 

5 

0.1 

0.1 

0.1 

100 

500 

preferential 

100 

5 

- 

0.1 

0.1 

20 

100 

attachment 

500 

5 

- 

0.1 

0.1 

100 

500 


800 

5 

- 

0.1 

0.1 

100 

500 

powerlaw 

100 

5 

0.1 

0.1 

0.1 

20 

100 

cluster 

500 

5 

0.1 

0.1 

0.1 

100 

500 


800 

5 

0.1 

0.1 

0.1 

100 

500 


Table 7: Algorithm parameters used to compare local clusters. 
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Discussion We address the first analytic objective listed in the introduction of 
this section by discussing the clusters output by eHKPR. Namely, we compare 
the clusters computed with eHKPR to the guarantees of Theorem |4j The results 
for each graph are given in Table [8] The first three columns indicate the ran¬ 
dom graph model and algorithm parameters used for each instance. The last two 
columns demonstrate how the (average) Cheeger ratio of clusters computed by 
eHKPR compare to the approximation guarantee of Theorem |4j Namely, The¬ 
orem [ 4 ] states that the cluster output will have Cheeger ratio < \Z8cj) with high 
probability. In every case the Cheeger ratio is well within the approximation 
bounds. 


Synthetic graphs 


Model 

|F| 

< j >, Target 
Cheeger ratio 

Cheeger ratio output by 
eHKPR 


small world 

100 

0.1 

0.173557 

0.894427 


500 

0.1 

0.47316 

0.894427 


800 

0.1 

0.510597 

0.894427 


1000 

0.1 

0.519399 

0.894427 

preferential 

100 

0.1 

0.523929 

0.894427 

attachment 

500 

0.1 

0.503542 

0.894427 


800 

0.1 

0.491046 

0.894427 

powerlaw 

100 

0.1 

0.517521 

0.894427 

cluster 

500 

0.1 

0.500312 

0.894427 


800 

0.1 

0.494145 

0.894427 


Table 8: Cheeger ratios of cluster output by eHKPR. 


The second objective is to compare clusters computed with the three dif¬ 
ferent local clustering algorithms eHKPR, HKPR, and PR. Table [9] is a col¬ 
lection of cluster statistics for the trials. For each graph instance we list the 
average Cheeger ratio and cluster volume of local clusters computed using the 
PR, HKPR, and eHKPR algorithms, respectively. 

We remark that for each graph there is little variation in Cheeger ratio and 
volume of clusters computed by the three different algorithms. We also note 
that there is no obvious trend as graphs get larger. The small world graphs 
demonstrate the greatest variation in cluster quality. However, as mentioned in 
Section [bj expander graphs, such as small world graphs, consist of one large 
cluster. 

It is worth noting that in some trials the output volume is significantly greater 
than twice the target volume. While this may seem like a contradiction, it is a 
consequence of our implementation. During a sweep one may choose to output 
a cluster of minimal Cheeger ratio, or one that satisfies volume constraints, or 
both. We are interested in comparing Cheeger ratios and so allow the sweep to 
continue checking clusters that are well beyond twice the target volume. 
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Synthetic graphs 


Model 

\v\ 

PR 

HKPR 

eHKPR 

small world 

100 

0.235159 
(volume = 52.52) 

0.087723 

(volume = 171.28) 

0.173557 
(volume = 142) 

500 

0.244261 

(volume = 190.16) 

0.062263 

(volume = 943.68) 

0.47316 

(volume = 206.64) 

800 

0.246564 

(volume = 162.68) 

0.064599 

(volume = 1413.6) 

0.510597 
(volume = 209.6) 

1000 

0.245612 

(volume = 584.56) 

0.064716 

(volume = 1907.4) 

0.519399 
(volume = 225.2) 

preferential 

attachment 

100 

0.430071 

(volume = 471.2) 

0.512819 

(volume = 467.16) 

0.523929 

(volume = 468.16) 

500 

0.508305 

(volume = 2461.96) 

0.51018 

(volume = 2459.4) 

0.503542 

(volume = 2463.28) 

800 

0.491046 

(volume = 3964.17) 

0.496369 

(volume = 3971.17) 

0.491046 

(volume = 3951.83) 

powerlaw 

cluster 

100 

0.426828 
(volume = 463.4) 

0.505277 

(volume = 465.44) 

0.517521 

(volume = 464.88) 

500 

0.487341 

(volume = 2447.12) 

0.507328 

(volume = 2460.44) 

0.500312 

(volume = 2446.28) 

800 

0.522281 
(volume = 3947) 

0.513365 
(volume = 3966) 

0.494145 
(volume = 3947) 


Table 9: Cheeger ratios of clusters output by different local clustering algorithms 
on synthetic data. 
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6.2 Real graphs 

For these trials we use graphs generated from real data summarized in Sec¬ 
tion [5?2j 


Procedure We compare clusters computed by each of the three algorithms as 
outlined in Procedure |T| In these trials we fix the seed vertex to be a member 
of a cluster with good Cheeger ratio. Using this seed vertex, we compare the 
clusters computed using the eHKPR, HKPR, PR algorithms. 

For each trial we use the parameters listed in Table [To] We note that in each 
case the target cluster volume is computed to be roughly the target cluster size 
times the average vertex degree, and here we use e = 0.1. 


Network 

|U| 

\E\ 

Average 

degree 

e 

Target 

Cheeger ratio 

Target 
cluster size 

Target 

cluster volume 

dolphins 

62 

159 

5 

0.1 

0.08 

20 

100 

polbooks 

105 

441 

00 

OO 

0.1 

0.05 

30 

270 

power 

4941 

6594 

2.7 

0.1 

0.05 

200 

600 

facebook 

4039 

88234 

43.7 

0.1 

0.05 

200 

2800 

enron 

36692 

183831 

10 

0.1 

0.05 

100 

1000 


Table 10: Graph and algorithm parameters used to compare local clusters. 


Discussion Table pi] lists ratios output by eHKPR compared with the approxi¬ 
mation guarantees of Theorem[4] In each case, the Cheeger ratios are well within 
the approximation bounds of Theorem [4] 


Real graphs 


Network 

<l>, Target 
Cheeger ratio 

Cheeger ratio output by 
eHKPR 

V80 

dolphins 

0.08 

0.083333 

0.8 

polbooks 

0.05 

0.052133 

0.632456 

power 

0.05 

0.346667 

0.632456 

facebook 

0.05 

0.056939 

0.632456 

enron 

0.05 

0.036602 

0.632456 


Table 11: Cheeger ratios of cluster output ClusterHKPR. 


The complete numerical data obtained from the set of the trials are given in 
Table [12] For each graph we list the Cheeger ratio, cluster volume, and addi¬ 
tionally the cluster size of local clusters computed using each of the algorithms 
PR, HKPR, and eHKPR, respectively. 
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Real graphs 


Network 

PR 

HKPR 

eHKPR 

dolphins 

0.226415 
(volume = 106) 
(size = 23) 

0.163636 
(volume = 110) 
(size = 24) 

0.083333 
(volume = 96) 
(size = 20) 

polbooks 

0.079518 
(volume = 415) 
(size = 48) 

0.245657 
(volume = 403) 
(size = 49) 

0.052133 
(volume = 422) 
(size = 50) 

power 

0.375 

(volume = 16) 
(size = 6) 

0.002764 
(volume = 4342) 
(size = 1564) 

0.346668 
(volume = 300) 
(size = 85) 

facebook 

0.418993 

(volume = 88140) 
(size = 3063) 

0.001277 

(volume = 67326) 
(size = 1094) 

0.056939 

(volume = 35266) 
(size = 258) 

enron 

0.48797 

(volume = 183612) 

“ 

0.036602 
(volume = 3579) 


Table 12: Cheeger ratios of cluster output by different local clustering algorithms. 


For each graph, the local cluster computed using eHKPR has smaller Cheeger 
ratio than the local cluster computed using PR. For the power graph, we ob¬ 
serve that the cluster of minimal Cheeger ratio was computed using the HKPR 
algorithm, but it is nearly a third the size of the entire network. The algorithms 
eHKPR and PR, on the other hand, each return smaller clusters. We remark 
that for real graphs, the clusters computed using sweeps over different vectors 
have more variation than for random graphs. 

At this point we remark about our choice of parameters for the trials. At this 
point the sensitivity of the algorithm to the choice of e, </>, s, and c has not been 
fully explored. In particular, it is worth studying the effect of <j> on the output 
cluster in future work. 

To conclude, we include visualizations of clusters computed in the facebook 
ego-network to illustrate the differences in local cluster detection. Figure [5] colors 
the vertices in a local cluster computed using the eHKPR algorithm, as described 
in Table |~i~2| Figure [6] colors the vertices in a local cluster compted using the PR 
algorithm. 

The numerical data of the last two sections validate the effectiveness and 
efficiency of local cluster detection using sweeps over e-approximate heat kernel 
pagerank. The experiments of Section [5] demonstrate that sampling a number 
of random walks of at most K steps yield a ranking of vertices within the error 
bounds of Theorem [T] This ranking in turn is used to compute a local cluster. 
What is more, this value K does not depend on parameters other than e. Specif¬ 
ically, it does not depend on the size of the graph or the desired cluster volume, 
size, or Cheeger ratio. Finally, the data of Section [6] validate the statements of 
Theorem [4] That is, perfoming a sweep over an approximate heat kernel pager¬ 
ank vector detects clusters of Cheeger ratio at most y/8(j) for a desired Cheeger 
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Fig. 5: Local cluster in facebook ego network computed using the eHKPR algo¬ 
rithm. 
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Fig. 6: Local cluster in facebook ego network computed using the PR algorithm. 
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ratio (j). The total cost of computing this cluster is O( ]°fogi 0 g(°-i) )> sublinear in 

the size of the graph. 
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